Predicting Expected Monthly Income and Expected Attrition is critical to understanding how to manage talent within any organization.
“According to the U.S. Bureau of Statistics, the average turnover rate in the U.S. is about 12% to 15% annually. According to LinkedIn, an average annual worldwide employee turnover rate is 10.9%.” source - https://www.talentlyft.com/en/blog/article/242/hr-metrics-how-and-why-to-calculate-employee-turnover-rate#:~:text=According%20to%20the%20U.S.%20Bureau,above%20the%20average%20turnover%20rates.
The turnover rate will of course vary from company to company but the data highlights a key insight that a significant portion of your employee population will need to be replaced each year. When it comes to Monthly Salary, the formula is much more complicated than paying everyone based on position title. Company’s must consider items like work experience, education, and a variety of other factors to determine fair compensation. Utilizing machine learning algorithms, we will use the relevant data provided to predict these two factors for Frito Lay.
Frito Lay has provided a detailed data set including 36 variables on their employees. The requested objective from their team is to provide any general insights found and determine a model to predict Attrition and another model to predict Monthly Income. Our team at DDSAnalytics have broken down the data to find the key factors driving these metrics. We then leveraged these factors using machine learning algorithms like Naive Bayes and Linear Regression to predict each of the requested variables.
#Load required libraries
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(naniar)
library(ggplot2)
library(ggthemes)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ purrr 0.3.4
## ✓ tidyr 1.1.2 ✓ forcats 0.5.0
## ✓ readr 1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(corrplot)
## corrplot 0.84 loaded
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(class)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
#Read in data and remove irrelevant columns
ee=read.csv("~/Documents/SMU_DS/Doing Data Science/Case_Study_2_DDS/Raw_Data_Files/CaseStudy2-data.csv",header=TRUE,stringsAsFactors = TRUE,na.strings = c("","NA"))
str(ee)
## 'data.frame': 870 obs. of 36 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
## $ EnvironmentSatisfaction : int 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
## $ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : int 3 3 3 3 3 4 3 3 3 3 ...
## $ RelationshipSatisfaction: int 3 1 3 3 3 3 1 3 4 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : int 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
ee=ee %>% select(-EmployeeCount,-StandardHours,-Over18,-EmployeeNumber)
#Where applicable, create alternative numerical version of the factor
summary(ee$PerformanceRating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 3.000 3.152 3.000 4.000
ee$fctrPerformanceRating=ifelse(ee$PerformanceRating<=3,'Low','High')
ee$fctrPerformanceRating=as.factor(ee$PerformanceRating)
ee$WorkLifeBalance_fctr
## NULL
ee=ee %>% mutate(DistanceFromHome_cat=ifelse(DistanceFromHome>19,"Very Far", ifelse(DistanceFromHome>9,"Far", ifelse(DistanceFromHome>3,"Close","Very Close"))))
ee$DistanceFromHome_cat=factor(ee$DistanceFromHome_cat)
ee=ee %>% mutate(numBusinessTravel=ifelse(BusinessTravel=="Non-Travel",0,
ifelse(BusinessTravel=="Travel_Rarely",1,
ifelse(BusinessTravel=="Travel_Frequently",2,3
))))
ee$numDepartment=ifelse(ee$Department=="Research & Development",0,1)
ee$numGender=ifelse(ee$Gender=="Female",0,1)
ee=ee %>% mutate(numMaritalStatus=ifelse(MaritalStatus=="Single",0,
ifelse(MaritalStatus=="Married",1,
ifelse(MaritalStatus=="Divorced",2,3
))))
ee$numOvertime=ifelse(ee$OverTime=="No",0,1)
ee$numAttrition=ifelse(ee$Attrition=="No",0,1)
#Look at scatterplot of all variables
test=ee %>% filter(Attrition=="Yes")
ee_num=ee %>% select_if(is.numeric)
ee_fctr=ee %>% select_if(is.factor)
ggplot(gather(ee_num), aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free_x')
#Look at linear model of JobSatisfaction and Monthly Income
ee %>% ggplot(aes(JobSatisfaction,MonthlyIncome,col=JobRole))+geom_smooth(method="lm")+ggtitle("Monthly Income vs. Job Satisfaction by Role")
## `geom_smooth()` using formula 'y ~ x'
#Look at specific values of Job Satifaction against Job Role
sat=ee %>% group_by(JobRole) %>% summarise(mean=mean(JobSatisfaction),median=median(JobSatisfaction))
## `summarise()` ungrouping output (override with `.groups` argument)
sat=data.frame(sat)
sat %>% arrange(desc(mean))
## JobRole mean median
## 1 Healthcare Representative 2.828947 3
## 2 Research Scientist 2.802326 3
## 3 Sales Executive 2.725000 3
## 4 Manufacturing Director 2.724138 3
## 5 Sales Representative 2.698113 3
## 6 Laboratory Technician 2.686275 3
## 7 Human Resources 2.555556 3
## 8 Manager 2.509804 2
## 9 Research Director 2.490196 3
sat
## JobRole mean median
## 1 Healthcare Representative 2.828947 3
## 2 Human Resources 2.555556 3
## 3 Laboratory Technician 2.686275 3
## 4 Manager 2.509804 2
## 5 Manufacturing Director 2.724138 3
## 6 Research Director 2.490196 3
## 7 Research Scientist 2.802326 3
## 8 Sales Executive 2.725000 3
## 9 Sales Representative 2.698113 3
#Visualize Job Satifaction against Job Role
ee %>% ggplot(aes(JobSatisfaction,reorder(JobRole,JobSatisfaction)))+geom_boxplot(fill='deepskyblue1',)+ggtitle("Job Satisfaction by Role")+theme(text = element_text(size=28))+stat_summary(fun.y=mean, geom="point", shape=23, size=5, color="red", fill="red")+xlab("Job Role")+ylab("Job Satisfaction")
## Warning: `fun.y` is deprecated. Use `fun` instead.
#Visualize Monthly Income against Job Role
ee %>% ggplot(aes(MonthlyIncome,reorder(JobRole,MonthlyIncome)))+geom_boxplot(fill='deepskyblue1',)+ggtitle("Monthly Income by Role")+theme(text = element_text(size=28))+stat_summary(fun.y=mean, geom="point", shape=23, size=3, color="red", fill="red")+xlab("Job Role")+ylab("Monthly Income")
## Warning: `fun.y` is deprecated. Use `fun` instead.
We have now created a numerical variable alternatives for each given variable where the variables can be converted based on category and ordinal placement. Also, we visualized each of the distributions for the variables. We can see many of the distributions are skewed. We also visualized Job Satisfaction by Job Role and Monthly income by Job Role. Each plot is organized by mean value.
#KNN Analysis on Attrition
#Test variables with Wicox Test for significance
vars_num=variable.names(ee_num[,-1])
p_matrix= matrix(nrow=length(ee_num)-1, ncol=3)
for (i in 1:length(vars_num)){
a=wilcox.test(ee_num[,(i+1)]~ee$Attrition,alternative="two.sided")
p_matrix[i,1]=vars_num[i]
p_matrix[i,2]=a[[3]]
p_matrix[i,3]=ifelse(a[[3]]<=.0017,"keep","remove")
}
p_matrix
## [,1] [,2] [,3]
## [1,] "Age" "2.05142624461662e-06" "keep"
## [2,] "DailyRate" "0.299457401638044" "remove"
## [3,] "DistanceFromHome" "0.0272542867612612" "remove"
## [4,] "Education" "0.132737345594014" "remove"
## [5,] "EnvironmentSatisfaction" "0.0384660720848109" "remove"
## [6,] "HourlyRate" "0.29009357195183" "remove"
## [7,] "JobInvolvement" "7.73304810764929e-07" "keep"
## [8,] "JobLevel" "2.49503721487344e-08" "keep"
## [9,] "JobSatisfaction" "0.00118261722782567" "keep"
## [10,] "MonthlyIncome" "4.07351265527439e-09" "keep"
## [11,] "MonthlyRate" "0.208563502820914" "remove"
## [12,] "NumCompaniesWorked" "0.17226278333996" "remove"
## [13,] "PercentSalaryHike" "0.975981196105236" "remove"
## [14,] "PerformanceRating" "0.651465117867211" "remove"
## [15,] "RelationshipSatisfaction" "0.29777866495085" "remove"
## [16,] "StockOptionLevel" "3.0761263616543e-09" "keep"
## [17,] "TotalWorkingYears" "4.04234499349295e-09" "keep"
## [18,] "TrainingTimesLastYear" "0.0922434095280687" "remove"
## [19,] "WorkLifeBalance" "0.034430719605193" "remove"
## [20,] "YearsAtCompany" "3.10980498624275e-08" "keep"
## [21,] "YearsInCurrentRole" "9.48224070619458e-08" "keep"
## [22,] "YearsSinceLastPromotion" "0.368074119349138" "remove"
## [23,] "YearsWithCurrManager" "9.34684457405231e-07" "keep"
## [24,] "numBusinessTravel" "0.0168665223239186" "remove"
## [25,] "numDepartment" "0.00291697843641116" "remove"
## [26,] "numGender" "0.456795406955404" "remove"
## [27,] "numMaritalStatus" "5.35518941999522e-09" "keep"
## [28,] "numOvertime" "1.06539927859169e-15" "keep"
## [29,] "numAttrition" "5.42845762548052e-191" "keep"
p_vals=data.frame(p_matrix)
colnames(p_vals)=c("var","p_value","sig")
p_keep=p_vals %>% filter(sig=="keep")
p_keep$p_value=as.numeric(p_keep$p_value)
p_keep=p_keep %>% arrange(p_value)
p_keep %>% filter(var!='numAttrition')
## var p_value sig
## 1 numOvertime 1.065399e-15 keep
## 2 StockOptionLevel 3.076126e-09 keep
## 3 TotalWorkingYears 4.042345e-09 keep
## 4 MonthlyIncome 4.073513e-09 keep
## 5 numMaritalStatus 5.355189e-09 keep
## 6 JobLevel 2.495037e-08 keep
## 7 YearsAtCompany 3.109805e-08 keep
## 8 YearsInCurrentRole 9.482241e-08 keep
## 9 JobInvolvement 7.733048e-07 keep
## 10 YearsWithCurrManager 9.346845e-07 keep
## 11 Age 2.051426e-06 keep
## 12 JobSatisfaction 1.182617e-03 keep
p_keep_vars=p_keep[,1]
scaled_vars=ee_num %>% select(all_of(p_keep_vars))
#Create Over-Sample to ballance Attrition
holder=scaled_vars
for(f in 1:10){
set.seed(f)
att_rows=holder[holder$Attrition==1,]
extrarows= sample(1:dim(att_rows)[1],round(.42143*dim(att_rows)[1]))
add_df=att_rows[extrarows,]
scaled_vars=rbind(scaled_vars,add_df)
}
str(scaled_vars)
## 'data.frame': 870 obs. of 13 variables:
## $ numAttrition : num 0 0 0 0 0 0 0 0 0 0 ...
## $ numOvertime : num 0 0 0 0 1 0 1 1 1 0 ...
## $ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ numMaritalStatus : num 2 0 0 1 0 2 1 2 1 1 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
## $ YearsWithCurrManager: int 3 9 2 7 3 7 3 0 0 7 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
str(ee_num)
## 'data.frame': 870 obs. of 30 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ EnvironmentSatisfaction : int 2 3 3 3 1 4 2 4 3 4 ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : int 3 3 3 3 3 4 3 3 3 3 ...
## $ RelationshipSatisfaction: int 3 1 3 3 3 3 1 3 4 2 ...
## $ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : int 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
## $ numBusinessTravel : num 1 1 2 1 2 2 1 1 1 2 ...
## $ numDepartment : num 1 0 0 1 0 0 0 1 1 0 ...
## $ numGender : num 1 1 1 0 0 1 1 0 0 1 ...
## $ numMaritalStatus : num 2 0 0 1 0 2 1 2 1 1 ...
## $ numOvertime : num 0 0 0 0 1 0 1 1 1 0 ...
## $ numAttrition : num 0 0 0 0 0 0 0 0 0 0 ...
#Scale Variables and Test KNN across K values
outcome_data= scaled_vars$numAttrition
scaled_vars=scaled_vars %>% select(-numAttrition)
scaled_vars=data.frame(scale(scaled_vars))
train_cols= scaled_vars
num_rand_samples= 15
max_k= 30
accuracy_matrix= matrix(nrow= num_rand_samples, ncol=max_k)
sensitivity_matrix= matrix(nrow= num_rand_samples, ncol=max_k)
specificity_matrix= matrix(nrow= num_rand_samples, ncol=max_k)
for(i in 1:num_rand_samples)
{
set.seed(i)
for(j in 1:max_k)
{
class_data= knn.cv(train_cols,cl = outcome_data, k=j)
CM= confusionMatrix(table(class_data,outcome_data))
accuracy_matrix[i,j]= CM$overall[1]
sensitivity_matrix[i,j]= CM$byClass[1]
specificity_matrix[i,j]= CM$byClass[2]
}
}
MeanAcc = colMeans(accuracy_matrix)
MeanSens = colMeans(sensitivity_matrix)
MeanSpec= colMeans(specificity_matrix)
which(MeanAcc==max(MeanAcc))
## [1] 5 9
which(MeanSens==max(MeanSens))
## [1] 30
which(MeanSpec==max(MeanSpec))
## [1] 2
max(MeanAcc)
## [1] 0.8655172
max(MeanSens)
## [1] 0.9866667
max(MeanSpec)
## [1] 0.4004762
par(mfrow=c(1,1))
#Viualize results
plot(x = seq(1,max_k,1),y = MeanAcc, ylim=c(0,1),type = "l", main= "KNN Classification of Attrition based on Selected Variables", xlab= "k Value", ylab= "Mean Acc (black) / Mean Spec (blue) / Mean Sens (red)",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
lines(x = seq(1,max_k,1),y = MeanSens, ylim=c(0,1),type = "l", col="red")
lines(x = seq(1,max_k,1),y = MeanSpec, ylim=c(0,1),type = "l", col="blue")
summary(ee$Attrition)
## No Yes
## 730 140
To counteract the sample imbalance in Attrition in all upcoming models, we will use an over-sample technique and randomly draw from the ‘Yes’ Attrition values until the sample is balanced. KNN utilized numerical values only and requires them to be scaled. The variables were selected with a Wilcox Rank Sum Test to predict Attrition and the p-value criteria of 0.0017. The result as plotted shows the highest sensitivity KNN can achieve is approximately 0.4 which is below the requirement. We need a different algorithm.
#Naive Bayes Analysis on Attrition
#Deterimine Variables using Chi-Squared Test
ee_nb_options=ee[,-c(1,3,34:40)]
vars2=variable.names(ee[,-c(1,3,34:40)])
p_matrix= matrix(nrow=length(vars2), ncol=3)
for (i in 1:length(vars2)){
a=chisq.test(ee$Attrition,ee_nb_options[,(i)])
p_matrix[i,1]=vars2[i]
p_matrix[i,2]=a$p.value
p_matrix[i,3]=ifelse(a$p.value<=.0017,"keep","remove")
}
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
p_matrix
## [,1] [,2] [,3]
## [1,] "Age" "0.000455925966956391" "keep"
## [2,] "BusinessTravel" "0.0499253583464617" "remove"
## [3,] "DailyRate" "0.394359650210051" "remove"
## [4,] "Department" "0.00942377313351953" "remove"
## [5,] "DistanceFromHome" "0.041015227721121" "remove"
## [6,] "Education" "0.624283776640735" "remove"
## [7,] "EducationField" "0.268219767928874" "remove"
## [8,] "EnvironmentSatisfaction" "0.0105408976789624" "remove"
## [9,] "Gender" "0.515129674660433" "remove"
## [10,] "HourlyRate" "0.165336227368764" "remove"
## [11,] "JobInvolvement" "5.21104090021876e-09" "keep"
## [12,] "JobLevel" "2.08470324574323e-08" "keep"
## [13,] "JobRole" "3.64683624511309e-10" "keep"
## [14,] "JobSatisfaction" "0.0111512175494692" "remove"
## [15,] "MaritalStatus" "3.37894621259824e-08" "keep"
## [16,] "MonthlyIncome" "0.548532282023289" "remove"
## [17,] "MonthlyRate" "0.536630277147251" "remove"
## [18,] "NumCompaniesWorked" "0.0167775870740727" "remove"
## [19,] "OverTime" "2.33298071371539e-15" "keep"
## [20,] "PercentSalaryHike" "0.277627401358292" "remove"
## [21,] "PerformanceRating" "0.746170559364242" "remove"
## [22,] "RelationshipSatisfaction" "0.372711655350202" "remove"
## [23,] "StockOptionLevel" "3.7244644710761e-12" "keep"
## [24,] "TotalWorkingYears" "0.000407183992741349" "keep"
## [25,] "TrainingTimesLastYear" "0.119204353121313" "remove"
## [26,] "WorkLifeBalance" "0.00249508979751358" "remove"
## [27,] "YearsAtCompany" "7.89367480428311e-05" "keep"
## [28,] "YearsInCurrentRole" "0.000792886613459846" "keep"
## [29,] "YearsSinceLastPromotion" "0.129359276246726" "remove"
## [30,] "YearsWithCurrManager" "0.000258107721542933" "keep"
## [31,] "fctrPerformanceRating" "0.746170559364242" "remove"
p_vals=data.frame(p_matrix)
colnames(p_vals)=c("var","p_value","sig")
p_keep=p_vals %>% filter(sig=="keep")
p_keep$p_value=as.numeric(p_keep$p_value)
p_keep=p_keep %>% arrange(p_value)
p_keep
## var p_value sig
## 1 OverTime 2.332981e-15 keep
## 2 StockOptionLevel 3.724464e-12 keep
## 3 JobRole 3.646836e-10 keep
## 4 JobInvolvement 5.211041e-09 keep
## 5 JobLevel 2.084703e-08 keep
## 6 MaritalStatus 3.378946e-08 keep
## 7 YearsAtCompany 7.893675e-05 keep
## 8 YearsWithCurrManager 2.581077e-04 keep
## 9 TotalWorkingYears 4.071840e-04 keep
## 10 Age 4.559260e-04 keep
## 11 YearsInCurrentRole 7.928866e-04 keep
keep_len=dim(p_keep)[1]
MeanAcc_nb=matrix(nrow= keep_len)
MeanSens_nb=matrix(nrow= keep_len)
MeanSpec_nb=matrix(nrow= keep_len)
#Big loop to loop through adding a parameter by significance
for(n in 1:keep_len){
p_keep_select=head(p_keep,n)
p_keep_vars=p_keep_select[,1]
ee_nb=ee %>% select(all_of(p_keep_vars))
ee_nb_num=ee_nb%>% select_if(is.numeric)
ee_nb_fctr=ee_nb%>% select_if(is.factor)
ee_nb_num=scale(ee_nb_num)
ee_nb_w_response=cbind(ee_nb_num,ee_nb_fctr,Attrition=ee$Attrition)
ee_nb_final=ee_nb_w_response
#Loop to create Over-Sample to balance attrition
for(f in 1:10){
set.seed(f)
att_rows=ee_nb_final[ee_nb_final$Attrition=="Yes",]
extrarows= sample(1:dim(att_rows)[1],round(.42143*dim(att_rows)[1]))
add_df=att_rows[extrarows,]
ee_nb_w_response=rbind(ee_nb_w_response,add_df)
}
#Loop to test Naive Bayes over multiple iterations
iterations= 20
masterAcc=matrix(nrow= iterations)
masterSens=matrix(nrow=iterations)
masterSpec=matrix(nrow=iterations)
splitPerc= .7 #training/test split percentage
for(j in 1:iterations)
{
set.seed(j)
trainIndices= sample(1:dim(ee_nb_w_response)[1],round(splitPerc*dim(ee_nb_w_response)[1]))
train= ee_nb_w_response[trainIndices,]
test=ee_nb_w_response[-trainIndices,]
train_columns_split_model= train[,-(n+1)]
test_columns= test[,-(n+1)]
model= naiveBayes(train_columns_split_model,train$Attrition)
CM= confusionMatrix(table(predict(model,test_columns), test$Attrition))
masterAcc[j]=CM$overall[1]
masterSens[j]=CM$byClass[1]
masterSpec[j]=CM$byClass[2]
}
MeanAcc_nb[n]=colMeans(masterAcc)
MeanSens_nb[n]=colMeans(masterSens)
MeanSpec_nb[n]= colMeans(masterSpec)
}
MeanAcc_nb
## [,1]
## [1,] 0.4872146
## [2,] 0.6675799
## [3,] 0.6985160
## [4,] 0.7281963
## [5,] 0.7143836
## [6,] 0.7232877
## [7,] 0.7281963
## [8,] 0.7237443
## [9,] 0.7214612
## [10,] 0.7147260
## [11,] 0.7012557
MeanSens_nb
## [,1]
## [1,] 0.6500000
## [2,] 0.7678776
## [3,] 0.7676154
## [4,] 0.7296704
## [5,] 0.7127203
## [6,] 0.6960368
## [7,] 0.7050290
## [8,] 0.7064496
## [9,] 0.6880368
## [10,] 0.6827533
## [11,] 0.6662602
MeanSpec_nb
## [,1]
## [1,] 0.3500000
## [2,] 0.5682520
## [3,] 0.6307902
## [4,] 0.7274035
## [5,] 0.7165578
## [6,] 0.7506261
## [7,] 0.7516524
## [8,] 0.7413414
## [9,] 0.7548895
## [10,] 0.7466924
## [11,] 0.7360456
m=MeanSens_nb+MeanSpec_nb
m
## [,1]
## [1,] 1.000000
## [2,] 1.336130
## [3,] 1.398406
## [4,] 1.457074
## [5,] 1.429278
## [6,] 1.446663
## [7,] 1.456681
## [8,] 1.447791
## [9,] 1.442926
## [10,] 1.429446
## [11,] 1.402306
#Visualize Results
plot(x = seq(1,length(MeanAcc_nb),1),y = MeanAcc_nb, ylim=c(0,1),type = "l", main= "NB Classification of Attrition based on Selected Variables", xlab= "Parameters", ylab= "Mean Acc (black) / Mean Spec (blue) / Mean Sens (red)",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
lines(x = seq(1,length(MeanSens_nb),1),y = MeanSens_nb, ylim=c(0,1),type = "l", col="red")
lines(x = seq(1,length(MeanSpec_nb),1),y = MeanSpec_nb, ylim=c(0,1),type = "l", col="blue")
dim(ee_nb_final)
## [1] 870 12
str(ee_nb_w_response)
## 'data.frame': 1460 obs. of 12 variables:
## $ StockOptionLevel : num 0.252 -0.914 -0.914 1.418 -0.914 ...
## $ JobInvolvement : num 0.394 -1.028 0.394 0.394 0.394 ...
## $ JobLevel : num -0.0358 2.7162 0.8815 0.8815 -0.9532 ...
## $ YearsAtCompany : num -0.326 2.165 -0.824 1.169 -0.16 ...
## $ YearsWithCurrManager: num -0.319 1.36 -0.599 0.8 -0.319 ...
## $ TotalWorkingYears : num -0.406 1.324 -0.14 0.392 -0.672 ...
## $ Age : num -0.541 0.355 -0.205 -0.541 -1.437 ...
## $ YearsInCurrentRole : num -0.606 0.768 -0.606 1.592 -0.331 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#Test final model on train data
model=naiveBayes(ee_nb_w_response[,-c(5:12)],ee_nb_w_response$Attrition)
CM= confusionMatrix(table(predict(model,ee_nb_w_response[,-12]),
ee_nb_w_response$Attrition))
CM
## Confusion Matrix and Statistics
##
##
## No Yes
## No 502 252
## Yes 228 478
##
## Accuracy : 0.6712
## 95% CI : (0.6465, 0.6953)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.3425
##
## Mcnemar's Test P-Value : 0.2938
##
## Sensitivity : 0.6877
## Specificity : 0.6548
## Pos Pred Value : 0.6658
## Neg Pred Value : 0.6771
## Prevalence : 0.5000
## Detection Rate : 0.3438
## Detection Prevalence : 0.5164
## Balanced Accuracy : 0.6712
##
## 'Positive' Class : No
##
#Create csv of predictions on test data and store predictions on GitHub
att_test=read.csv("~/Documents/SMU_DS/Doing Data Science/Case_Study_2_DDS/Raw_Data_Files/CaseStudy2CompSet No Attrition.csv",header = TRUE,stringsAsFactors = TRUE)
ref=ee_nb_w_response %>% select(-Attrition)
att_vars=att_test %>% select(all_of(variable.names(ref)))
variable.names(ee_nb_w_response)
## [1] "StockOptionLevel" "JobInvolvement" "JobLevel"
## [4] "YearsAtCompany" "YearsWithCurrManager" "TotalWorkingYears"
## [7] "Age" "YearsInCurrentRole" "OverTime"
## [10] "JobRole" "MaritalStatus" "Attrition"
model=naiveBayes(ee_nb_w_response[,-c(5:12)],ee_nb_w_response$Attrition)
att_preds=predict(model,att_vars)
final_preds=data.frame(ID=att_test$ID,Attrition=att_preds)
summary(final_preds)
## ID Attrition
## Min. :1171 No :220
## 1st Qu.:1246 Yes: 80
## Median :1320
## Mean :1320
## 3rd Qu.:1395
## Max. :1470
summary(ee$Attrition)
## No Yes
## 730 140
summary(ee_nb_w_response$Attrition)
## No Yes
## 730 730
#write.csv(final_preds,"Case2PredictionsStewartAttrition.csv")
Naive Bayes allows categorical and continuous variables to be used. The variables were selected using a Chi-Squared Test with a p-value criteria of 0.0017. The plot tested adding one parameter at a time in the order of significance. The ideal result is using the top four parameters (Overtime, StockOptionLevel, JobRole, and JobInvolvement). They achieved a sensitivity of 0.6877 and a specificity of 0.6548.
#Select Variables
ee_rf_options=ee[,-c(1,3,34:40)]
vars2=variable.names(ee[,-c(1,3,34:40)])
p_matrix= matrix(nrow=length(vars2), ncol=3)
for (i in 1:length(vars2)){
a=chisq.test(ee$Attrition,ee_rf_options[,(i)])
p_matrix[i,1]=vars2[i]
p_matrix[i,2]=a$p.value
p_matrix[i,3]=ifelse(a$p.value<=.3,"keep","remove")
}
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
p_matrix
## [,1] [,2] [,3]
## [1,] "Age" "0.000455925966956391" "keep"
## [2,] "BusinessTravel" "0.0499253583464617" "keep"
## [3,] "DailyRate" "0.394359650210051" "remove"
## [4,] "Department" "0.00942377313351953" "keep"
## [5,] "DistanceFromHome" "0.041015227721121" "keep"
## [6,] "Education" "0.624283776640735" "remove"
## [7,] "EducationField" "0.268219767928874" "keep"
## [8,] "EnvironmentSatisfaction" "0.0105408976789624" "keep"
## [9,] "Gender" "0.515129674660433" "remove"
## [10,] "HourlyRate" "0.165336227368764" "keep"
## [11,] "JobInvolvement" "5.21104090021876e-09" "keep"
## [12,] "JobLevel" "2.08470324574323e-08" "keep"
## [13,] "JobRole" "3.64683624511309e-10" "keep"
## [14,] "JobSatisfaction" "0.0111512175494692" "keep"
## [15,] "MaritalStatus" "3.37894621259824e-08" "keep"
## [16,] "MonthlyIncome" "0.548532282023289" "remove"
## [17,] "MonthlyRate" "0.536630277147251" "remove"
## [18,] "NumCompaniesWorked" "0.0167775870740727" "keep"
## [19,] "OverTime" "2.33298071371539e-15" "keep"
## [20,] "PercentSalaryHike" "0.277627401358292" "keep"
## [21,] "PerformanceRating" "0.746170559364242" "remove"
## [22,] "RelationshipSatisfaction" "0.372711655350202" "remove"
## [23,] "StockOptionLevel" "3.7244644710761e-12" "keep"
## [24,] "TotalWorkingYears" "0.000407183992741349" "keep"
## [25,] "TrainingTimesLastYear" "0.119204353121313" "keep"
## [26,] "WorkLifeBalance" "0.00249508979751358" "keep"
## [27,] "YearsAtCompany" "7.89367480428311e-05" "keep"
## [28,] "YearsInCurrentRole" "0.000792886613459846" "keep"
## [29,] "YearsSinceLastPromotion" "0.129359276246726" "keep"
## [30,] "YearsWithCurrManager" "0.000258107721542933" "keep"
## [31,] "fctrPerformanceRating" "0.746170559364242" "remove"
p_vals=data.frame(p_matrix)
colnames(p_vals)=c("var","p_value","sig")
p_keep=p_vals %>% filter(sig=="keep")
p_keep$p_value=as.numeric(p_keep$p_value)
p_keep=p_keep %>% arrange(p_value)
p_keep[c(3,5,6,7,8),]
## var p_value sig
## 3 JobRole 3.646836e-10 keep
## 5 JobLevel 2.084703e-08 keep
## 6 MaritalStatus 3.378946e-08 keep
## 7 YearsAtCompany 7.893675e-05 keep
## 8 YearsWithCurrManager 2.581077e-04 keep
keep_len=dim(p_keep)[1]
#p_keep_select=head(p_keep,5)
#p_keep_vars=p_keep_select[,1]
p_keep_vars=p_keep[,1]
#Test RF
ee_rf=ee %>% select(all_of(p_keep_vars))
ee_rf_num=ee_rf%>% select_if(is.numeric)
ee_rf_fctr=ee_rf%>% select_if(is.factor)
ee_rf_num=scale(ee_rf_num)
ee_rf_w_response=cbind(ee_rf_num,ee_rf_fctr,Attrition=ee$Attrition)
iterations=5
masterAcc=matrix(nrow= iterations)
masterSens=matrix(nrow=iterations)
masterSpec=matrix(nrow=iterations)
a=c()
i=5
holder=ee_rf_w_response
for(f in 1:10){
set.seed(f)
att_rows=holder[holder$Attrition==1,]
extrarows= sample(1:dim(att_rows)[1],round(.42143*dim(att_rows)[1]))
add_df=att_rows[extrarows,]
ee_rf_w_response=rbind(ee_rf_w_response,add_df)
}
for (i in 1:iterations) {
set.seed(i)
trainIndices= sample(1:dim(ee_rf_w_response)[1],round(splitPerc*dim(ee_rf_w_response)[1]))
train= ee_rf_w_response[trainIndices,]
test=ee_rf_w_response[-trainIndices,]
train_columns_split_model= train[,-dim(ee_rf_w_response)[2]]
test_columns= test[,-dim(ee_rf_w_response)[2]]
model <- randomForest(Attrition ~ ., data = train, ntree = 1000, mtry = i, importance = TRUE)
predValid <- predict(model, test_columns, type = "class")
CM= confusionMatrix(table(predValid, test$Attrition))
#a[i] = mean(predValid == test$Attrition)
masterAcc[i]=CM$overall[1]
masterSens[i]=CM$byClass[1]
masterSpec[i]=CM$byClass[2]
}
MeanAcc_rf=colMeans(masterAcc)
MeanSens_rf=colMeans(masterSens)
MeanSpec_rf= colMeans(masterSpec)
MeanAcc_rf
## [1] 0.8582375
MeanSens_rf
## [1] 0.9946218
MeanSpec_rf
## [1] 0.09697308
#Visualize Results
plot(x = seq(1,iterations,1),y = masterAcc, ylim=c(0,1),type = "l", main= "NB Classification of Attrition", xlab= "k Value", ylab= "Mean Accuracy")
lines(x = seq(1,iterations,1),y = masterSens, ylim=c(0,1),type = "l", xlab= "k Value", col="red")
lines(x = seq(1,iterations,1),y = masterSpec, ylim=c(0,1),type = "l", xlab= "k Value", col="blue")
model <- randomForest(Attrition ~ ., data = train, ntree = 1500, mtry = 7, importance = TRUE)
model$importance
## No Yes MeanDecreaseAccuracy
## StockOptionLevel 2.191667e-03 9.274061e-03 0.0032870995
## JobInvolvement 2.191694e-03 1.129934e-02 0.0036473073
## JobLevel 3.873309e-03 1.147421e-02 0.0051202063
## YearsAtCompany 6.697009e-03 1.691399e-02 0.0082618003
## YearsWithCurrManager 4.366647e-03 1.089474e-02 0.0053665764
## TotalWorkingYears 5.835737e-03 7.500813e-03 0.0061235570
## Age 3.533869e-03 1.314585e-02 0.0050255608
## YearsInCurrentRole 3.060616e-03 8.446561e-03 0.0039210255
## WorkLifeBalance 6.267695e-04 1.945619e-03 0.0008460357
## EnvironmentSatisfaction 1.047288e-03 4.763286e-03 0.0016642176
## JobSatisfaction 1.318489e-03 5.758538e-03 0.0020468737
## NumCompaniesWorked 1.269049e-03 1.149504e-03 0.0012447503
## DistanceFromHome 1.571207e-04 8.340183e-04 0.0002844521
## TrainingTimesLastYear 2.180285e-04 -4.220486e-05 0.0001858943
## YearsSinceLastPromotion 2.111187e-03 -7.843346e-04 0.0016057048
## HourlyRate -8.809999e-05 -2.463588e-03 -0.0005020419
## PercentSalaryHike 2.224664e-04 2.054763e-03 0.0005175547
## OverTime 8.779198e-03 4.181615e-02 0.0141316249
## JobRole 6.550367e-03 1.024109e-02 0.0071215637
## MaritalStatus 2.350777e-03 1.020928e-02 0.0036464525
## Department 1.049103e-03 9.644016e-04 0.0010399215
## BusinessTravel -1.289774e-04 -2.504549e-04 -0.0001374869
## EducationField 2.163567e-03 5.124491e-03 0.0026506974
## MeanDecreaseGini
## StockOptionLevel 6.740399
## JobInvolvement 8.136381
## JobLevel 5.183559
## YearsAtCompany 9.755474
## YearsWithCurrManager 6.677067
## TotalWorkingYears 9.382286
## Age 12.508573
## YearsInCurrentRole 6.069229
## WorkLifeBalance 5.312414
## EnvironmentSatisfaction 6.000749
## JobSatisfaction 6.317692
## NumCompaniesWorked 6.838441
## DistanceFromHome 8.782696
## TrainingTimesLastYear 5.341245
## YearsSinceLastPromotion 5.041925
## HourlyRate 9.994680
## PercentSalaryHike 7.749202
## OverTime 10.989015
## JobRole 12.622837
## MaritalStatus 5.020851
## Department 1.961729
## BusinessTravel 2.393429
## EducationField 8.480301
The Random Forest model was an experiment on a model I had not used before. After experimenting with different mtry values and randomTree values, the highest specificity was less than 0.2 which is far below the 0.6 criteria. This model was not usable.
#Linear Regression to Predict Income
#Look at significance and correlations
res2 <- rcorr(as.matrix(ee_num[,-c(25:30)]))
str(res2$r)
## num [1:24, 1:24] 1 -0.0401 -0.0329 0.0661 -0.0495 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:24] "ID" "Age" "DailyRate" "DistanceFromHome" ...
## ..$ : chr [1:24] "ID" "Age" "DailyRate" "DistanceFromHome" ...
corrplot(res2$r, type="upper", order="hclust",
p.mat = res2$P, sig.level = 0.1, insig = "blank")
c1=data.frame(attributes (res2$P[,11]))
c2=data.frame(res2$P[,11])
c3=data.frame(abs(res2$r[,11]))
cor_p=cbind(c1,c2,c3)
colnames(cor_p)=c("vars","p_value","r_value")
cor_p=cor_p %>% arrange(desc(r_value))
cor_p$sig=ifelse(cor_p$p_value<=.01,'sig','not sig')
cor_p
## vars p_value r_value
## MonthlyIncome MonthlyIncome NA 1.000000e+00
## JobLevel JobLevel 0.000000e+00 9.516400e-01
## TotalWorkingYears TotalWorkingYears 0.000000e+00 7.785112e-01
## YearsAtCompany YearsAtCompany 0.000000e+00 4.913790e-01
## Age Age 0.000000e+00 4.842883e-01
## YearsInCurrentRole YearsInCurrentRole 0.000000e+00 3.618405e-01
## YearsWithCurrManager YearsWithCurrManager 0.000000e+00 3.284875e-01
## YearsSinceLastPromotion YearsSinceLastPromotion 0.000000e+00 3.159116e-01
## NumCompaniesWorked NumCompaniesWorked 3.839539e-06 1.558943e-01
## Education Education 1.701095e-04 1.271314e-01
## MonthlyRate MonthlyRate 5.684494e-02 6.459407e-02
## PercentSalaryHike PercentSalaryHike 1.123575e-01 5.386594e-02
## JobSatisfaction JobSatisfaction 1.172293e-01 5.314853e-02
## ID ID 1.280196e-01 5.163903e-02
## PerformanceRating PerformanceRating 2.037486e-01 4.313100e-02
## TrainingTimesLastYear TrainingTimesLastYear 2.503303e-01 3.901457e-02
## WorkLifeBalance WorkLifeBalance 5.399821e-01 2.080489e-02
## StockOptionLevel StockOptionLevel 5.779471e-01 1.888872e-02
## EnvironmentSatisfaction EnvironmentSatisfaction 6.190955e-01 1.687744e-02
## DistanceFromHome DistanceFromHome 8.443188e-01 6.667155e-03
## RelationshipSatisfaction RelationshipSatisfaction 9.089071e-01 3.884678e-03
## HourlyRate HourlyRate 9.438534e-01 2.391151e-03
## JobInvolvement JobInvolvement 9.894867e-01 4.473792e-04
## DailyRate DailyRate 9.979342e-01 8.790339e-05
## sig
## MonthlyIncome <NA>
## JobLevel sig
## TotalWorkingYears sig
## YearsAtCompany sig
## Age sig
## YearsInCurrentRole sig
## YearsWithCurrManager sig
## YearsSinceLastPromotion sig
## NumCompaniesWorked sig
## Education sig
## MonthlyRate not sig
## PercentSalaryHike not sig
## JobSatisfaction not sig
## ID not sig
## PerformanceRating not sig
## TrainingTimesLastYear not sig
## WorkLifeBalance not sig
## StockOptionLevel not sig
## EnvironmentSatisfaction not sig
## DistanceFromHome not sig
## RelationshipSatisfaction not sig
## HourlyRate not sig
## JobInvolvement not sig
## DailyRate not sig
cor_p_select=cor_p %>% filter(sig=='sig'|vars=='MonthlyIncome' )
cor_p_select
## vars p_value r_value sig
## MonthlyIncome MonthlyIncome NA 1.0000000 <NA>
## JobLevel JobLevel 0.000000e+00 0.9516400 sig
## TotalWorkingYears TotalWorkingYears 0.000000e+00 0.7785112 sig
## YearsAtCompany YearsAtCompany 0.000000e+00 0.4913790 sig
## Age Age 0.000000e+00 0.4842883 sig
## YearsInCurrentRole YearsInCurrentRole 0.000000e+00 0.3618405 sig
## YearsWithCurrManager YearsWithCurrManager 0.000000e+00 0.3284875 sig
## YearsSinceLastPromotion YearsSinceLastPromotion 0.000000e+00 0.3159116 sig
## NumCompaniesWorked NumCompaniesWorked 3.839539e-06 0.1558943 sig
## Education Education 1.701095e-04 0.1271314 sig
cor_p_select=cor_p_select[,1]
eekeep=ee %>% select(all_of(cor_p_select))
eekeep_len=length(eekeep)-1
par(mfrow=c(4,3))
# for(i in 1:eekeep_len){
# plot(eekeep$MonthlyIncome,eekeep[,(i+1)])
# }
#Looking at untransformed scatterplots
eegath=eekeep %>%
as_data_frame() %>%
gather(key = "variable", value = "value",
-MonthlyIncome)
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
ggplot(eegath, aes(x = value, y = MonthlyIncome)) +
geom_point(aes()) +
geom_smooth(method='loess')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 3
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.98
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1629e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.98
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2.02
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.1629e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
ggplot(eegath, aes(x = value, y = MonthlyIncome)) +
geom_point(aes()) +
geom_smooth(method='lm')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
#Looking at log(MonthlyIncome) scatter
eekeep_logy=eekeep
eekeep_logy$MonthlyIncome=log(eekeep_logy$MonthlyIncome)
eegath2=eekeep_logy %>%
as_data_frame() %>%
gather(key = "variable", value = "value",
-MonthlyIncome)
ggplot(eegath2, aes(x = value, y = MonthlyIncome)) +
geom_point(aes()) +
geom_smooth(method='lm')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
ggplot(eegath2, aes(x = value, y = MonthlyIncome)) +
geom_point(aes()) +
geom_smooth(method='loess')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 3
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.98
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1629e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.98
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2.02
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.1629e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
#Looking at log(MonthlyIncome) and log all variables scatter
eekeep_logall=eekeep
len=length(eekeep_logall)
for(i in 1:len){
eekeep_logall[,i]=log(eekeep_logall[,i]+.01)
}
str(eekeep_logall)
## 'data.frame': 870 obs. of 10 variables:
## $ MonthlyIncome : num 8.39 9.88 9.14 9.25 8.23 ...
## $ JobLevel : num 0.69813 1.61144 1.10194 1.10194 0.00995 ...
## $ TotalWorkingYears : num 2.08 3.04 2.3 2.64 1.79 ...
## $ YearsAtCompany : num 1.611 2.996 0.698 2.64 1.793 ...
## $ Age : num 3.47 3.69 3.56 3.47 3.18 ...
## $ YearsInCurrentRole : num 0.698 1.947 0.698 2.304 1.102 ...
## $ YearsWithCurrManager : num 1.102 2.198 0.698 1.947 1.102 ...
## $ YearsSinceLastPromotion: num -4.60517 1.38879 0.69813 1.61144 0.00995 ...
## $ NumCompaniesWorked : num 0.69813 0.00995 0.69813 0.00995 0.00995 ...
## $ Education : num 1.38879 1.10194 0.69813 1.38879 0.00995 ...
eegath3=eekeep_logall %>%
as_data_frame() %>%
gather(key = "variable", value = "value",
-MonthlyIncome)
ggplot(eegath3, aes(x = value, y = MonthlyIncome)) +
geom_point(aes()) +
geom_smooth(method='lm')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
#Looking at the MonthlyIncome and log of all x values
eekeep_logx=eekeep
len=length(eekeep_logx)-1
for(i in 1:len){
eekeep_logx[,i+1]=log(eekeep_logx[,i+1]+.01)
}
str(eekeep_logx)
## 'data.frame': 870 obs. of 10 variables:
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ JobLevel : num 0.69813 1.61144 1.10194 1.10194 0.00995 ...
## $ TotalWorkingYears : num 2.08 3.04 2.3 2.64 1.79 ...
## $ YearsAtCompany : num 1.611 2.996 0.698 2.64 1.793 ...
## $ Age : num 3.47 3.69 3.56 3.47 3.18 ...
## $ YearsInCurrentRole : num 0.698 1.947 0.698 2.304 1.102 ...
## $ YearsWithCurrManager : num 1.102 2.198 0.698 1.947 1.102 ...
## $ YearsSinceLastPromotion: num -4.60517 1.38879 0.69813 1.61144 0.00995 ...
## $ NumCompaniesWorked : num 0.69813 0.00995 0.69813 0.00995 0.00995 ...
## $ Education : num 1.38879 1.10194 0.69813 1.38879 0.00995 ...
eegath3=eekeep_logx %>%
as_data_frame() %>%
gather(key = "variable", value = "value",
-MonthlyIncome)
ggplot(eegath3, aes(x = value, y = MonthlyIncome)) +
geom_point(aes()) +
geom_smooth(method='lm')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
ggplot(eegath3, aes(x = value, y = MonthlyIncome)) +
geom_point(aes()) +
geom_smooth(method='loess')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 1.1019
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.40381
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.7084e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 1.1019
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.40381
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.7084e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.0019429
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 5.8987e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.1924
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.0019429
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 1.1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 5.8987e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1.1924
str(ee_fctr)
## 'data.frame': 870 obs. of 10 variables:
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
## $ fctrPerformanceRating: Factor w/ 2 levels "3","4": 1 1 1 1 1 2 1 1 1 1 ...
## $ DistanceFromHome_cat : Factor w/ 4 levels "Close","Far",..: 2 2 2 3 3 2 1 2 2 2 ...
#fit the model with selected variables
par(mfrow=c(2,2))
fit1=glm(MonthlyIncome~I(log(JobLevel)^2)+I(TotalWorkingYears^2)+I(YearsInCurrentRole^3)+I(Education^2)+log(JobLevel)+.-Age-YearsSinceLastPromotion-YearsInCurrentRole-NumCompaniesWorked-YearsWithCurrManager-Education-JobLevel,data = eekeep)
summary(fit1)
##
## Call:
## glm(formula = MonthlyIncome ~ I(log(JobLevel)^2) + I(TotalWorkingYears^2) +
## I(YearsInCurrentRole^3) + I(Education^2) + log(JobLevel) +
## . - Age - YearsSinceLastPromotion - YearsInCurrentRole -
## NumCompaniesWorked - YearsWithCurrManager - Education - JobLevel,
## data = eekeep)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5254.3 -686.4 -106.6 603.6 4716.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.278e+03 1.309e+02 17.401 < 2e-16 ***
## I(log(JobLevel)^2) 6.888e+03 2.145e+02 32.107 < 2e-16 ***
## I(TotalWorkingYears^2) -1.860e+00 6.364e-01 -2.923 0.00356 **
## I(YearsInCurrentRole^3) 1.613e-01 8.774e-02 1.838 0.06638 .
## I(Education^2) 8.148e+00 7.457e+00 1.093 0.27490
## log(JobLevel) -1.243e+03 2.720e+02 -4.572 5.54e-06 ***
## TotalWorkingYears 1.041e+02 2.170e+01 4.796 1.90e-06 ***
## YearsAtCompany -3.284e+01 1.061e+01 -3.096 0.00202 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1560633)
##
## Null deviance: 1.8370e+10 on 869 degrees of freedom
## Residual deviance: 1.3453e+09 on 862 degrees of freedom
## AIC: 14886
##
## Number of Fisher Scoring iterations: 2
plot(fit1)
ee_nums_w_cat_options=cbind(MonthlyIncome=ee$MonthlyIncome,JobLevel=ee$JobLevel,TotalWorkingYears=ee$TotalWorkingYears,YearsInCurrentRole=ee$YearsInCurrentRole,Education=ee$Education,YearsAtCompany=ee$YearsAtCompany,ee_fctr)
fit2=glm(MonthlyIncome~I(log(JobLevel)^2)+I(TotalWorkingYears^2)+I(Education^2)+I(YearsInCurrentRole^3)+log(JobLevel)+JobRole*TotalWorkingYears+.-YearsInCurrentRole-Education-JobLevel
-Attrition-BusinessTravel-Department-EducationField-Gender-MaritalStatus-OverTime-fctrPerformanceRating-DistanceFromHome_cat
,data = ee_nums_w_cat_options,)
summary(fit2)
##
## Call:
## glm(formula = MonthlyIncome ~ I(log(JobLevel)^2) + I(TotalWorkingYears^2) +
## I(Education^2) + I(YearsInCurrentRole^3) + log(JobLevel) +
## JobRole * TotalWorkingYears + . - YearsInCurrentRole - Education -
## JobLevel - Attrition - BusinessTravel - Department - EducationField -
## Gender - MaritalStatus - OverTime - fctrPerformanceRating -
## DistanceFromHome_cat, data = ee_nums_w_cat_options)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2918.1 -602.4 -88.5 548.6 4494.9
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2358.73104 349.41640 6.750
## I(log(JobLevel)^2) 4629.89531 242.76374 19.072
## I(TotalWorkingYears^2) -3.72633 0.70555 -5.281
## I(Education^2) -2.39343 5.97168 -0.401
## I(YearsInCurrentRole^3) 0.07745 0.07046 1.099
## log(JobLevel) -503.10157 344.61383 -1.460
## JobRoleHuman Resources 88.04394 544.48865 0.162
## JobRoleLaboratory Technician 150.61356 360.92893 0.417
## JobRoleManager 2789.93845 657.29350 4.245
## JobRoleManufacturing Director 27.26143 339.67603 0.080
## JobRoleResearch Director 3389.46125 503.65723 6.730
## JobRoleResearch Scientist -26.64747 364.37632 -0.073
## JobRoleSales Executive 230.02934 302.17781 0.761
## JobRoleSales Representative -148.27843 384.96350 -0.385
## TotalWorkingYears 197.24329 31.56686 6.248
## YearsAtCompany -7.93561 8.63699 -0.919
## JobRoleHuman Resources:TotalWorkingYears -107.48209 65.72100 -1.635
## JobRoleLaboratory Technician:TotalWorkingYears -125.94409 26.22448 -4.803
## JobRoleManager:TotalWorkingYears 19.78787 28.41280 0.696
## JobRoleManufacturing Director:TotalWorkingYears 10.11299 23.24741 0.435
## JobRoleResearch Director:TotalWorkingYears -0.22683 25.45611 -0.009
## JobRoleResearch Scientist:TotalWorkingYears -76.69482 26.59239 -2.884
## JobRoleSales Executive:TotalWorkingYears -17.68582 21.34505 -0.829
## JobRoleSales Representative:TotalWorkingYears -78.64200 36.47677 -2.156
## Pr(>|t|)
## (Intercept) 2.73e-11 ***
## I(log(JobLevel)^2) < 2e-16 ***
## I(TotalWorkingYears^2) 1.63e-07 ***
## I(Education^2) 0.68867
## I(YearsInCurrentRole^3) 0.27203
## log(JobLevel) 0.14469
## JobRoleHuman Resources 0.87158
## JobRoleLaboratory Technician 0.67657
## JobRoleManager 2.43e-05 ***
## JobRoleManufacturing Director 0.93605
## JobRoleResearch Director 3.13e-11 ***
## JobRoleResearch Scientist 0.94172
## JobRoleSales Executive 0.44673
## JobRoleSales Representative 0.70020
## TotalWorkingYears 6.56e-10 ***
## YearsAtCompany 0.35847
## JobRoleHuman Resources:TotalWorkingYears 0.10233
## JobRoleLaboratory Technician:TotalWorkingYears 1.85e-06 ***
## JobRoleManager:TotalWorkingYears 0.48634
## JobRoleManufacturing Director:TotalWorkingYears 0.66366
## JobRoleResearch Director:TotalWorkingYears 0.99289
## JobRoleResearch Scientist:TotalWorkingYears 0.00403 **
## JobRoleSales Executive:TotalWorkingYears 0.40758
## JobRoleSales Representative:TotalWorkingYears 0.03137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 973521.6)
##
## Null deviance: 1.837e+10 on 869 degrees of freedom
## Residual deviance: 8.236e+08 on 846 degrees of freedom
## AIC: 14491
##
## Number of Fisher Scoring iterations: 2
plot(fit2)
RSS <- c(crossprod(fit2$residuals))
MSE <- RSS / length(fit2$residuals)
RMSE <- sqrt(MSE)
RMSE
## [1] 972.9676
e=ee_nums_w_cat_options
e$JobLevel=factor(e$JobLevel)
fit_alt=glm(MonthlyIncome~I(TotalWorkingYears^2)+I(Education^2)+I(YearsInCurrentRole^3)+JobRole*TotalWorkingYears+JobLevel*JobRole+.-YearsInCurrentRole-Education-Attrition-BusinessTravel-Department-EducationField-Gender-MaritalStatus-OverTime-fctrPerformanceRating-DistanceFromHome_cat
,data = e)
summary(fit_alt)
##
## Call:
## glm(formula = MonthlyIncome ~ I(TotalWorkingYears^2) + I(Education^2) +
## I(YearsInCurrentRole^3) + JobRole * TotalWorkingYears + JobLevel *
## JobRole + . - YearsInCurrentRole - Education - Attrition -
## BusinessTravel - Department - EducationField - Gender - MaritalStatus -
## OverTime - fctrPerformanceRating - DistanceFromHome_cat,
## data = e)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2912.6 -585.0 -108.0 535.9 4477.7
##
## Coefficients: (19 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) -1.849e+02 1.190e+03 -0.155
## I(TotalWorkingYears^2) -3.858e+00 8.002e-01 -4.821
## I(Education^2) -3.571e+00 6.007e+00 -0.595
## I(YearsInCurrentRole^3) 6.078e-02 7.082e-02 0.858
## JobRoleHuman Resources 2.733e+03 1.258e+03 2.173
## JobRoleLaboratory Technician 2.547e+03 1.185e+03 2.150
## JobRoleManager 3.719e+03 1.220e+03 3.048
## JobRoleManufacturing Director 1.240e+03 9.831e+02 1.261
## JobRoleResearch Director 4.544e+03 9.756e+02 4.657
## JobRoleResearch Scientist 2.618e+03 1.213e+03 2.158
## JobRoleSales Executive 1.633e+03 8.943e+02 1.826
## JobRoleSales Representative 2.394e+03 1.193e+03 2.006
## TotalWorkingYears 2.163e+02 3.587e+01 6.029
## JobLevel2 4.282e+03 1.106e+03 3.872
## JobLevel3 7.262e+03 1.071e+03 6.783
## JobLevel4 9.932e+03 1.196e+03 8.307
## JobLevel5 1.252e+04 1.245e+03 10.051
## YearsAtCompany -4.358e+00 8.763e+00 -0.497
## JobRoleHuman Resources:TotalWorkingYears -1.561e+02 7.489e+01 -2.085
## JobRoleLaboratory Technician:TotalWorkingYears -1.096e+02 3.269e+01 -3.352
## JobRoleManager:TotalWorkingYears 1.133e+01 4.012e+01 0.282
## JobRoleManufacturing Director:TotalWorkingYears -1.415e+01 3.087e+01 -0.458
## JobRoleResearch Director:TotalWorkingYears -1.974e+01 3.332e+01 -0.592
## JobRoleResearch Scientist:TotalWorkingYears -1.172e+02 3.143e+01 -3.729
## JobRoleSales Executive:TotalWorkingYears -4.557e+01 2.732e+01 -1.668
## JobRoleSales Representative:TotalWorkingYears -9.700e+01 4.041e+01 -2.400
## JobRoleHuman Resources:JobLevel2 -2.000e+03 1.360e+03 -1.471
## JobRoleLaboratory Technician:JobLevel2 -2.825e+03 1.129e+03 -2.501
## JobRoleManager:JobLevel2 NA NA NA
## JobRoleManufacturing Director:JobLevel2 -1.025e+03 7.664e+02 -1.338
## JobRoleResearch Director:JobLevel2 NA NA NA
## JobRoleResearch Scientist:JobLevel2 -2.069e+03 1.084e+03 -1.908
## JobRoleSales Executive:JobLevel2 -1.193e+03 7.069e+02 -1.688
## JobRoleSales Representative:JobLevel2 -2.367e+03 1.281e+03 -1.848
## JobRoleHuman Resources:JobLevel3 -1.556e+03 1.326e+03 -1.174
## JobRoleLaboratory Technician:JobLevel3 -3.761e+03 1.268e+03 -2.966
## JobRoleManager:JobLevel3 -6.751e+02 9.418e+02 -0.717
## JobRoleManufacturing Director:JobLevel3 -8.401e+02 7.145e+02 -1.176
## JobRoleResearch Director:JobLevel3 -8.888e+02 6.571e+02 -1.353
## JobRoleResearch Scientist:JobLevel3 NA NA NA
## JobRoleSales Executive:JobLevel3 -9.501e+02 6.700e+02 -1.418
## JobRoleSales Representative:JobLevel3 NA NA NA
## JobRoleHuman Resources:JobLevel4 NA NA NA
## JobRoleLaboratory Technician:JobLevel4 NA NA NA
## JobRoleManager:JobLevel4 -2.716e+02 4.711e+02 -0.577
## JobRoleManufacturing Director:JobLevel4 NA NA NA
## JobRoleResearch Director:JobLevel4 NA NA NA
## JobRoleResearch Scientist:JobLevel4 NA NA NA
## JobRoleSales Executive:JobLevel4 NA NA NA
## JobRoleSales Representative:JobLevel4 NA NA NA
## JobRoleHuman Resources:JobLevel5 NA NA NA
## JobRoleLaboratory Technician:JobLevel5 NA NA NA
## JobRoleManager:JobLevel5 NA NA NA
## JobRoleManufacturing Director:JobLevel5 NA NA NA
## JobRoleResearch Director:JobLevel5 NA NA NA
## JobRoleResearch Scientist:JobLevel5 NA NA NA
## JobRoleSales Executive:JobLevel5 NA NA NA
## JobRoleSales Representative:JobLevel5 NA NA NA
## Pr(>|t|)
## (Intercept) 0.876585
## I(TotalWorkingYears^2) 1.70e-06 ***
## I(Education^2) 0.552324
## I(YearsInCurrentRole^3) 0.391057
## JobRoleHuman Resources 0.030085 *
## JobRoleLaboratory Technician 0.031861 *
## JobRoleManager 0.002375 **
## JobRoleManufacturing Director 0.207659
## JobRoleResearch Director 3.73e-06 ***
## JobRoleResearch Scientist 0.031207 *
## JobRoleSales Executive 0.068199 .
## JobRoleSales Representative 0.045145 *
## TotalWorkingYears 2.47e-09 ***
## JobLevel2 0.000116 ***
## JobLevel3 2.23e-11 ***
## JobLevel4 3.98e-16 ***
## JobLevel5 < 2e-16 ***
## YearsAtCompany 0.619134
## JobRoleHuman Resources:TotalWorkingYears 0.037417 *
## JobRoleLaboratory Technician:TotalWorkingYears 0.000838 ***
## JobRoleManager:TotalWorkingYears 0.777675
## JobRoleManufacturing Director:TotalWorkingYears 0.646881
## JobRoleResearch Director:TotalWorkingYears 0.553790
## JobRoleResearch Scientist:TotalWorkingYears 0.000205 ***
## JobRoleSales Executive:TotalWorkingYears 0.095745 .
## JobRoleSales Representative:TotalWorkingYears 0.016607 *
## JobRoleHuman Resources:JobLevel2 0.141765
## JobRoleLaboratory Technician:JobLevel2 0.012558 *
## JobRoleManager:JobLevel2 NA
## JobRoleManufacturing Director:JobLevel2 0.181384
## JobRoleResearch Director:JobLevel2 NA
## JobRoleResearch Scientist:JobLevel2 0.056747 .
## JobRoleSales Executive:JobLevel2 0.091730 .
## JobRoleSales Representative:JobLevel2 0.065008 .
## JobRoleHuman Resources:JobLevel3 0.240839
## JobRoleLaboratory Technician:JobLevel3 0.003107 **
## JobRoleManager:JobLevel3 0.473682
## JobRoleManufacturing Director:JobLevel3 0.240048
## JobRoleResearch Director:JobLevel3 0.176552
## JobRoleResearch Scientist:JobLevel3 NA
## JobRoleSales Executive:JobLevel3 0.156543
## JobRoleSales Representative:JobLevel3 NA
## JobRoleHuman Resources:JobLevel4 NA
## JobRoleLaboratory Technician:JobLevel4 NA
## JobRoleManager:JobLevel4 0.564372
## JobRoleManufacturing Director:JobLevel4 NA
## JobRoleResearch Director:JobLevel4 NA
## JobRoleResearch Scientist:JobLevel4 NA
## JobRoleSales Executive:JobLevel4 NA
## JobRoleSales Representative:JobLevel4 NA
## JobRoleHuman Resources:JobLevel5 NA
## JobRoleLaboratory Technician:JobLevel5 NA
## JobRoleManager:JobLevel5 NA
## JobRoleManufacturing Director:JobLevel5 NA
## JobRoleResearch Director:JobLevel5 NA
## JobRoleResearch Scientist:JobLevel5 NA
## JobRoleSales Executive:JobLevel5 NA
## JobRoleSales Representative:JobLevel5 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 970373.4)
##
## Null deviance: 1.8370e+10 on 869 degrees of freedom
## Residual deviance: 8.0638e+08 on 831 degrees of freedom
## AIC: 14502
##
## Number of Fisher Scoring iterations: 2
plot(fit_alt)
## Warning: not plotting observations with leverage one:
## 523
#visualizing JobRole by the numerical variables remaining
ee_final=data.frame(MonthlyIncome=ee$MonthlyIncome,JobLevel=ee$JobLevel,TotalWorkingYears=ee$TotalWorkingYears,YearsInCurrentRole=ee$YearsInCurrentRole,Education=ee$Education,YearsAtCompany=ee$YearsAtCompany,JobRole=ee$JobRole)
str(ee_final)
## 'data.frame': 870 obs. of 7 variables:
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ YearsInCurrentRole: int 2 7 2 10 3 7 2 0 1 2 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
eegath3=ee_final %>%
as_data_frame() %>%
gather(key = "variable", value = "value",
-MonthlyIncome,-JobRole)
ggplot(eegath3, aes(x = value, y = MonthlyIncome, col=JobRole)) +
geom_point(aes()) +
geom_smooth(method='lm')+
facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
#This is the final model with the lowest AIC
fit3=glm(MonthlyIncome~I(log(JobLevel)^2)+log(JobLevel)+I(TotalWorkingYears^2)+TotalWorkingYears+JobRole*TotalWorkingYears+.-YearsAtCompany-YearsInCurrentRole-Education-JobLevel,data = ee_final)
summary(fit3)
##
## Call:
## glm(formula = MonthlyIncome ~ I(log(JobLevel)^2) + log(JobLevel) +
## I(TotalWorkingYears^2) + TotalWorkingYears + JobRole * TotalWorkingYears +
## . - YearsAtCompany - YearsInCurrentRole - Education - JobLevel,
## data = ee_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2919.9 -602.5 -89.9 556.0 4502.9
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2358.7147 345.9016 6.819
## I(log(JobLevel)^2) 4623.3690 241.9862 19.106
## log(JobLevel) -512.7206 343.3719 -1.493
## I(TotalWorkingYears^2) -3.7071 0.7044 -5.263
## TotalWorkingYears 192.9223 30.9803 6.227
## JobRoleHuman Resources 56.6393 543.0945 0.104
## JobRoleLaboratory Technician 116.7663 358.8957 0.325
## JobRoleManager 2773.8790 656.4915 4.225
## JobRoleManufacturing Director -16.5977 336.3977 -0.049
## JobRoleResearch Director 3354.4335 501.0200 6.695
## JobRoleResearch Scientist -64.5418 361.9981 -0.178
## JobRoleSales Executive 187.8560 299.7262 0.627
## JobRoleSales Representative -171.9492 383.0521 -0.449
## TotalWorkingYears:JobRoleHuman Resources -105.5045 65.6387 -1.607
## TotalWorkingYears:JobRoleLaboratory Technician -123.2451 26.0996 -4.722
## TotalWorkingYears:JobRoleManager 22.3097 28.2718 0.789
## TotalWorkingYears:JobRoleManufacturing Director 14.5206 22.8600 0.635
## TotalWorkingYears:JobRoleResearch Director 3.3896 25.1079 0.135
## TotalWorkingYears:JobRoleResearch Scientist -74.1409 26.4507 -2.803
## TotalWorkingYears:JobRoleSales Executive -14.2372 21.1310 -0.674
## TotalWorkingYears:JobRoleSales Representative -77.1937 36.3547 -2.123
## Pr(>|t|)
## (Intercept) 1.74e-11 ***
## I(log(JobLevel)^2) < 2e-16 ***
## log(JobLevel) 0.13576
## I(TotalWorkingYears^2) 1.80e-07 ***
## TotalWorkingYears 7.46e-10 ***
## JobRoleHuman Resources 0.91696
## JobRoleLaboratory Technician 0.74500
## JobRoleManager 2.64e-05 ***
## JobRoleManufacturing Director 0.96066
## JobRoleResearch Director 3.91e-11 ***
## JobRoleResearch Scientist 0.85854
## JobRoleSales Executive 0.53099
## JobRoleSales Representative 0.65362
## TotalWorkingYears:JobRoleHuman Resources 0.10835
## TotalWorkingYears:JobRoleLaboratory Technician 2.73e-06 ***
## TotalWorkingYears:JobRoleManager 0.43027
## TotalWorkingYears:JobRoleManufacturing Director 0.52547
## TotalWorkingYears:JobRoleResearch Director 0.89264
## TotalWorkingYears:JobRoleResearch Scientist 0.00518 **
## TotalWorkingYears:JobRoleSales Executive 0.50065
## TotalWorkingYears:JobRoleSales Representative 0.03401 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 971831)
##
## Null deviance: 1.8370e+10 on 869 degrees of freedom
## Residual deviance: 8.2508e+08 on 849 degrees of freedom
## AIC: 14486
##
## Number of Fisher Scoring iterations: 2
par(mfrow=c(3,2))
plot(fit3)
RSS <- c(crossprod(fit3$residuals))
MSE <- RSS / length(fit3$residuals)
RMSE <- sqrt(MSE)
RMSE
## [1] 973.8445
#####Predicting the test values
# income_test = read.csv("Raw_Data_Files/CaseStudy2CompSet No Salary.csv",header = TRUE,stringsAsFactors = TRUE)
# ref=ee_final %>% select(-MonthlyIncome)
# ref=cbind(ID=ee$ID,ref)
# income_test=income_test %>% select(all_of(variable.names(ref)))
# str(income_test)
# preds=predict(fit3,income_test)
# preds=unlist(preds, use.names=FALSE)
# income_preds=data.frame(ID=income_test$ID,MonthlyIncome=preds)
# income_preds
# summary(income_preds)
# summary(ee$MonthlyIncome)
#write.csv(income_preds,"Case2PredictionsStewartSalary.csv")
Using multiple linear regression, we were able to predict Monthly Income. To measure the significance of numerical variables to Monthly Income, a t-test was used. To measure the significance of categorical variables to Monthly Income, a Chi-Squared Test was used. Iterating through combinations of possible models, the final model (fit3) was found to have the lowest AIC. The RMSE = 973.8445.
Overall, we evaluated Job Role against Job Satisfaction and Monthly Income. We identified key variables and predicted Attrition using Naive Bayes. We identified key variables and predicted Monthly Income using Multiple Linear Regression.
Next Steps: We need to understand if the submitted test predictions have a fit similar to our model. If so, we can proceed with implementation. We should also evaluate future opportunities to leverage data science within Frito Lay.